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O ' Abstract 

(N ' 

We investigate the learning rate of multiple kernel leaning (MKL) with elastic-net regular- 
ization, which consists of an £i-regularizer for inducing the sparsity and an ^-regularizer 
for controlling the smoothness. We focus on a sparse setting where the total number of ker- 
nels is large but the number of non-zero components of the ground truth is relatively small, 
and prove that elastic-net MKL achieves the minimax learning rate on the ^-mixed-norm 
ball. Our bound is sharper than the convergence rates ever shown, and has a property that 
the smoother the truth is, the faster the convergence rate is. 



1 Introduction 

c« ' 

Learning with kernels such as support vector machines has be en demonstrated to be a 
promising approach, given that ke rnels were chosen appropriately (jScholkopf and Smolal . l2002t 
IShawe- Taylor and Cristianinil . I2004T ) . So far, various stra tegies have been emp loyed for choosing 
appropriate kernels, ranging from simple cross-validation (|Chapelle et all 120021) to more sophisti- 
cated ' kernel learning' approac hes ()Ong et all 120051 , lArgyriou et all 120061 lBacbTl2009l [Cortes et ail , 
^j. ■ l2lxlallVa7rna and Babul . 120091 ) . 

Multiple kernel learning (MKL) is one of the systematic approaches to learning kernels, 
whic h tries to find the opt imal linear combination o f prefixed b ase-kernels by convex optimiza- 
tion (lLanckriet et all I2004T ) ■ The seminal paper by iBach et "ail (|2004f ) showed that this linear- 
combination MKL formulation can be interpreted as £i-mixed-norm regularization (i.e., the sum of 
the norms of the base kernels). Based on this interpretation, several variations of MKL were pro- 
posed, and promising performance was achieved by 'intermediate' regularization strategies between 
the sparse (^i) and dense (£2) reg ularizers, e.g., a mix t ure of li-mixed-norm and ^2-iriixed-norm 
' called the elastic-net regularization (IShawe-Tavloii [20081 Tbmioka and Suzuki I2009D and £ p -mixed- 

norm regularization with 1 < p < 2 (|Micchelri and Pontill 120051 iKloft et all l2009ll 
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Together with the active development of practical MKL optimization algor ithms, theoretical anal- 
ysis of MKL has also been extensively conducted. For £i-mixed-norm MKL, iKoltchinskii and Yuan! 

(2008) established the learning rate ci~n~ T + 7 + dlog(M) /n under rather restrictive conditions, 
where n is the number of samples, d is the number of non-zero components of the ground truth, M 
is the number of kernels, and s (0 < s < 1) is a constant representing the complexity of the repro- 
ducing kernel Hilbert spaces (RKHSs). Their conditions include a smoothness assum ption of the 
ground truth (q = 1 in our terminology (Assumption [5])). For elastic- net regularization. iMeier etaLI 

(2009) gave a near optimal convergence rate d (n/ log(Af)) _T + 7 . Recently, IKoltchinskii and Yuan! 



2010) showed that MKL with a variant of £i-mixed-norm regularization achieves the minimax op- 
timal convergence rate, which successfully got a sharper dependency with respect to log(M) than 
the bound of I Meier et al.l ((2009) and established the bound cin _I +i + d\og(M)/n. Another line of 
research considers the cases where the ground truth is not sparse, and boun ds the Rademacher com- 
plexi t y of a candidate kernel class by a pseudo - dimension of the kernel c lass (jSrebro and Ben-Davidl . 
I2006L lYing and Campbell l20ul ICortes et all I2009H IKloft et all l2010h . 

In this paper, we focus on the sparse setting (i.e., the total number of kernels is large, but the 
number of non-zero components of the ground truth is relatively small) , and derive a sharp learning 



rate for elastic-net MKL. Our new learning rate, 



i+g 



dlog(Af) 



is faster than all the existing bounds, where i?2, ff * is a kind of the ^-mixed-norm of the truth and 
<? (0 < q < 1) is a constant depending on the smoothness of the ground truth. 
Our contributions are summarized as follows. 

• The sharpest existing bound given by |Koltchinskii and Yuan! (|2010l ) achieves the minimax rate 
on the £oo -mixed-norm ball (jRaskutti et al l . 120091 |2010|) . Our work follows this line and show 



that the learning rate for elastic- net MKL further achieves the minimax rate on the l^-mixed- 
norm ball , which is faster than that on the ^oo-mixed-norm ball. This result implies that the 
bound bv lKoltchinskii and Yuan! ([20100 is tight only when the ground truth is evenly spread in 
the non-zero components. 

• We included the smoothness q of the ground truth into our learning rate, where the ground 
truth is said to be smooth if it is represented as a convolution of a certain function and an 
integral kernel (see Assumption [2). Intuitively for larger q, the truth is smoother. We show 
that, the smoother the truth is, the faster the convergence rate is. That is, the resultant 
convergence rate becomes as if the complexity of RKHSs was instead of the true complexity 



s. Meier et ail (|2009ft . IKoltchinskii and Yuan! (|2010D assumed q = and iKoltchinskii and Yuanl 
(2008) considered a situation of q — 1. Our analysis covers those situations. 

2 Preliminaries 

In this section, we formulate elastic-net MKL, and summarize mathematical tools that are needed 
for theoretical analysis. 

2.1 Formulation 

Suppose we are given n samples (xi,yi)f =1 where xt belongs to an input space X and 6 R. We 
denote the marginal distribution of X by LI. We consider a MKL regression problem in which the 
unknown target function is represented as a form of f(x) = Em=i fm(%) where each f m belongs to 

a different RKHS H m (to = 1, ... , M) with kernel k m over X x X . 

The elastic- net MKL we consider in this paper is the version considered in iMeier et all (120091) : 

M 



Y N ( M \ M 

argmin - V" I j/,— V] fmfa) + V Af 

( m = l,...,M) »=1 \ m=l / m=l 



/ = argmin f ( w - ^ f m ( Xl ) ) + ^ X^WWUK + A^||/ m ||^ m + A 3 n) ]T \\f m \& m , (1) 



m— 1 



where ||/ m || n := \J ■^fm{xi) 2 and ||/m||w m is the RKHS norm of f m in H m . The regularizer is the 

mixture of 4-term Em \j\\fm\\n + A^ l) ||/ m ||^ m and £ 2 -term E m Il/m||« m - In that sense, we say 
that the regularizer is of the elastic- net typ43 (|Zou and Hastid . 120051). Here the i\ term is a mi xture 
of the empirical L2 norm ||/ m ||„ and the RKHS norm || f^H^ . IKoltchinskii and Yuanl (|2010h also 

considered £\ regularization that is a mixture of these quantities: J2 m Il/m||n + A^H/mll'Hm- 

By the representer theorem (jKimeldorf and Wahbal 1 1 9 7 it) , the solution / can be expressed as a 
linear combination of nM kernels: 3a m ^ € K , f m {x) = ET=i a m,ik m (x,Xi). Thus, using the Gram 
matrix K m = (k m (xi, Xj))ij , the regularizer in ([!]) is expressed as 



where a m = (a mi j)jLj £ R". Thus, we ca n solve the problem by a SOCP (secon d-order cone 
progr amming) solver as in Ba ch et ail (|2004l ). or the coordinate descent algorithms (jMeier et all 

l2ooa) . 



1 There is another version of MKL with elastic-net regularization considered in Shawc- Taylor (2008) and 

iTomioka and Suzukil pOl ). that is, \[ n) £^ =1 ll/m||w m +A^ n) E" =1 |l/m||« m (i.e., there is no ||/m||n term 
in the regularizer). However, we focus on the former one because the later one is too loose to properly bound 
the irrelevant components of the estimated function. 
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2.2 Notations and Assumptions 

Here, we present several assumptions used in our theoretical analysis and prepare notations. 

Let % = Hi ffi • • • ffi Hm- We denote by /* € % the ground truth satisfying the following 
assumption. 

Assumption 1 (Basic Assumptions) 

(43J1) There exists f* = (/*,... ,/^) G "H sttc/i i/iai E[F|X] = Em=i ZmPO; a7ld tte TO * se 6 : = 
y — f*(X) is bounded as \e\ < L. 

(A[T]-2) For each m = 1, . . . , M , "H m is separable and sup^g^ \k m {X, X)\ < 1. 

The first assumption in (A[1}1 ) ensures the model % is correctly specified, and the technical assump- 
tion |e| < L allows ef to be Lipschitz continuous with respect to /. These assumptions are not 
essential and can be relaxed to misspecified models and unbounded noise such as Gaussian noise 
(Ras kutti et al.l . I2010D . However, for the sake of simplicity, we assume these conditions. 
It is known that the assumption (Afl~|-2) gives the following relation: 



|| /m|| oo^ SUp(fc m (x, •) , f m >« m <sup \\k m (x, •)ll« m ||/m||w m <sup \fk m (x,x)\\f m \\ Hm < \\fm\\n m - 

XX X 

Later, we will also assume a stronger (but practical) condition on the sup-norm in Assumption [5] 
We define an operator T m : H m — > % m as 

9m)n m '■= E[fm(X)g m (X)], 

where f m , g m S H m . Due to Mercer's theorem, there are an orthonormal system {4>k,m}k,m in L,2(Tl) 
and the spectrum {/J,k, m }k,m such that k m has the following spectral representation: 

oo 

k m (x,x') = Hk,m(f>k,m(x)(j>k, m {x / ). (2) 
fe=l 

By this spectral representation, the inner-product of RKHS can be expressed as (/m,5m)« m = 

Efctl Mfc, m (/ra: <t>k,m) L 2 (U)( ( t > k,m) 9m) L 2 (n)- 

Assumption 2 (Convolution Assumption) There exist a real number < q < 1 and g* m g T-L m 
such that 

(M f* m {x)= f k^ 2 \x,x')g* m {x')dn{x') (Vm = l,...,M), 



J x 

where km^ 2 \x,x') — ^\^=il l l!m^ k , m ^ x ) < t )k , m i x ')- This is equivalent to the following operator repre- 
sentation: 

fm — Tmg m - 

The constant q controls the smoothness of the truth because is a convolution of the in- 
tegral kernel km^ 2 ^ and g^, and high frequency components are depressed as q becomes large. 
Therefore, as q becomes large , /* becomes "smooth" . The assumption (A[2j) was considered in 
ICaponnetto and de Vitol ((2007) to an alyze the convergence rate of l east-squares estimators in a sin- 
gle kernel setting. In MKL settings, iKoltchinskii and Yuan! (|2008h showed a fast learning rate of 
MKL, and Bach (2008) employed the assumption for q = 1 to show the consistency of MKL. Propo- 
sition 9 of I Bach! (|2008) gave a su fficient condi t ion to fulfill (J^f with q = 1 for translation invariant 
kernels k m (x,x') = h m(x — x'). iMeier et al.l (l2009f) considered a situation with q — on Sobolev 
space; the analysis of IKoltchinskii and Yuan! (|2010l ) also corresponds to q = 0. Note that with 
q = imposes nothing on the smoothness about the truth, and our analysis also covers this case. 

We will show in Appendix [S] that as q increases, the space of the functions that satisfy 
becomes "simple" . Thus, it might be natural to consider that, under the Convolution Assumption 
(AO, the learning rate becomes faster as q increases. Although this conjecture is actually true, it is 
not obvious because the Convolution Assumption only restricts the ground truth, but not the search 
space. 

Next we introduce a parameter representing the complexity of RKHSs. 
Assumption 3 (Spectral Assumption) There exist < s < 1 and c such that 
(A© Mfc, m <cfc-», (1 < Vfc, 1 < Vto < M), 

where {/J,k,m}k is the spectrum of the kernel k m (see Eq.^j). 
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It was show n that the spectral as sumption (A[3]) is equivalent to the classical covering number 
assumption^ (Stcinwart ct al., 2009). If the spectral assumption (A[3]) holds, there exists a constant 
C that depends only on s and c such that 

Af(e,Bn m ,L 2 (U))<Ce' 2s , (3) 

and the converse is also true (see Theorem 15 of ISteinwart et al.l (|2009h and ISteinwartl (|2008l) for 
details). Therefore, if s is large, at least one of the RKHSs is "complex", and if s is small, all 
the RKHSs are "simple" . A more detailed characterization of the covering number in terms of the 
spectrum is provided in Appendix [A~J The covering number of the space of functions that satisfy the 
Convolution Assumption (AO is also provided there. 

We denote by io the indices of truly active kernels, i.e., 

:= {m | \\f^,\\n m > 0}. 

For / = J2m=ifm e ^ an d a subset of indices I C {1, ...,M}, we define Hi — @ m eiH m and 
denote by fi € Hi the restriction of / to an index set I, i.e., fi = J2 m ei /™- For a given set of 
indices J C {1, . . . , M}, let k(I) be defined as follows: 

m J ^ n . II Sme//™lli 2 (n) w , _ , n l 
k(1) := sup < k > k < — 2 V/ m e H m (m Gi)>. 

[ Z-mMEl ll/mllz, 2 (II) J 

k(J) represents the correlation of RKHSs inside the indices /. Similarly, we define the canonical 
correlations of RKHSs between / and I c as follows: 



p(I) := sup 



{Si, 9i^ 



•L 2 (TI) 



l//||L 2 (n)l|g/<=IU 2 (n) 



fieHi,gi.eHi,Ji^Q,gi^O 



These quantities give a connection between the L2(n)-norm of / € H and the L2(n)-norm of 
{fm\mei as shown in the following lemma. The proof is given in Appendix [Bl 



Lemma 1 For all I C {1, . . . , M}, we have 



ll/lli 2(n) >(i-p(/) 2 )«(/)(Eii/™HL(in 

\rnel 

We impose the following assumption for k(Iq) and p(Io)- 

Assumption 4 (Incoherence Assumption) For the truly active components Io, k(Io) is strictly 
positive and p(Io) is strictly less than 1: 

(m o< K (/ )(i-p 2 (/ )). 

This condition is known as the incoherence condition (jKoltchinskii and Yuanl . I2008L I Meier et all 

2009), i.e., RKHSs are not too dependent on each other. In the theoretical analysis, we also obtain 
an upper bound of the L2(n)-norm of / — /* in terms of the L2(n)-norm of {/ m — /4}me/ - Thus, 
by the incoherence condition and Lemma [TJ we may focus on bounding the L2(n)-norm of the "low- 

dimen sional" components {f m — /m}me/ i instead of all the compone nts. iKoltchinskii and Yuan! 
(|2010f ) considered a weaker condition including the restricted isometry (jCandes and Tad . l2007f) in- 
stead of (A[4|. Such a weaker condition is also applicable to our analysis, but we employ (A[4j) for 
simplicity. 

Finally we impose the following technical assumption related to the sup-norm of the members in 
the RKHSs. 

Assumption 5 (Sup-norm Assumption) Along with the Spectral Assumption (JfB), there exists 
a constant C± such that 

(43 H/Jloo ^Cill/JI^II/mllf^ (V/ m eW TO ,m = l,...,M), 

where s is the exponent defined in the Spectral Assumption (J^l. 



2 The e-covering number A/"(e, Bu m , £2(1!)) wit h respect to £2(1!) is the minima l number of balls with 
radius e needed to cover the unit ball Bu m in H der Vaart and Welmed . Il996f ). 
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This assumption is satisfied if the RKHS is a Sobolev space or is continuously embeddable in a 
Sobolev space. For example, the RKHSs of Gaussian kernels are continuously embedded in all 
Sobolev spaces, and thus satisfy the Sup- norm Assumption (A[5]) . More generally, RKHSs with 
m-times continuously diffcrentiable kernels on a closed Euclidean ball in M. d are also continuously 
embedd ed in a Sobolev sp ace, and satisfy the Sup-norm Assumption with s = ^ (see Corollary 
4.36 of ISteinwartl (|2008| )). Therefore this assumption is somewhat common for practically used 
ke rnels. A more general necess a ry and sufficient condi tion in terms of real interpolation is shown 
in lBennett and Sharplevl (|1988f ). ISteinwart et al.l (|2009() used this assumption to show the optimal 
rates for regularized regression using a single kernel function, and one can find detailed discussions 
about the assumption there. 

3 Convergence rate analysis 

In this section, we present our main result. 

3.1 The convergence rate of elastic- net MKL 

Here we derive the learning rate of the estimator / defined by Eq. (Q}. We denote the number of 
truly active components by d := \Iq\. We may suppose that the number of kernels M and the number 
of active kernels d are increasing with respect to the number of samples n. Our main purpose of this 
section is to show that the learning rate can be faster than the existi ng bounds. The existing boun d 
has already be e n sho wn to be optimal on the £oo-mixed-norm ball iKoltchinskii and Yuan! (|2010l ). 



iRaskutti et al.l (|2010T) . Our claim is that the convergence rate can further achieve the minimax 
optimal rate on the li-mixed-norm ball, which is faster than that on the f^-mixed-norm ball. 
Define rj(t) for t > as 

rj{t) := max(l, \rt, t/y/n). 

For given A > 0, we define £„ as 



, ,^ ( A" 5 A~2 /log(M) \ 

£n ■= £n(A) = I —i=- V — V \J . (4) 

Theorem 2 Suppose Assumptions^^ are satisfied, and let A > be an arbitrary positive number. 
Then there exist universal constants C\,Ci and a constant ip s depending on s,c,L,C\ such that if 
A]™'', A^ and \^ are set as A} = ip s r](t)tin(^), A^ = A, A^"^ = A, then for all n and r(> 0) 
satisfying lo ^ f ^ < 1 and the inequality 

Ci max(^^, r) (d + £^ =l \\g* m fn K 



(i - p(i ) 2 Mi ) 



< 1, 



have 



=i 



with probability 1 - exp(-t) - exp (- min | , ^pT)^ }) f or al1 1 - L 

A proof of Theorem [2] is provided in Appendix |DJ The convergence rate ([5]) contains a tuning 
parameter A. Here we optimize this parameter. Let 



M 



p 



and we assume that R p , g * is strictly positive for all p > 1 (Rp, g * > 0). If n is sufficiently large 
compared with i?2, 9 * , the RHS of Eq. ([5]) is minimized by 



l + q + s 



up to constants. Then the convergence rate ([5]) is reduced to 
11/ - f ll! 2(n) < Cr ( d^hn-^R^ + + d^n-^-T^^R^) , (6) 
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where C\ is a constant. If n 1 + s -^f — > C with a constant C (this holds if ||<?™||?{ m < VC for all m), 

K 2,g* 



then Eq. ^ becomes 



dlog(M) 



1/ - /* IIL ( n) < C 2 ^^n-T+^i?^ + l^ll j ; (7) 

where C2 is a constant. We see that, as q becomes large (the truth becomes smooth) or s becomes 
small (the RKHSs become simple), the convergence rate becomes faster when i?2,g* > 1- In the next 
subsection, we show that this bound ([7]) achieves the minimax optimal rate on the .^-mixed-norm 
ball. 

3.2 Minimax learning rate of i^-mixed-norm ball 

To derive the minimax rate, we slightly simplify the setup. First, we assume that the input X is 
expressed as X = X for some space X . Second, all the RKHSs {'H m }^f =1 are the same as an 
RKHS i-L defined on X. Finally, we assume that the marginal distribution II of input is a product 
of a probability distribution Q, i.e., II = Q M . Thus, an input x = (£ (1) , . . . ,x^) G X = X M is 
a concatenation of M random variables {i < - m - ) }m=i independently and identically distributed from 
the distribution Q. Moreover, the function class % is a class of functions / such that 

M 
m— 1 

where f m G % for all m. Without loss of generality, we may assume that all functions in % are 
centered: 

E^ Q [/(X)]=0 (v/d). 

We assume that the spectrum of the kernel k corresponding to the RKHS H decays at the rate of 
— -. That is, in addition to Assumption [3J we impose the following lower bound to the spectrum: 
there exist d ,c (> 0) such that 

ck~~ < fik < ck~~ , (8) 

where {fik}k is the spectrum of the kernel k (see Eq.Q). We also assume that the noise {ei}" =1 is 
generated by a Gaussian distribution with mean and standard deviation a. 

Let T-Li Q {d) be the set of functions with d non-zero components in % defined by 

U to {d) := {(/1, . . . , f M ) e H I |{m I \\f m \\n m + 0}| < d}. 
We define ^p-mixed-norm ball (p > 1) with radius R in Ho{d) as 

f M 1 ~j 

U% q (R) := U = J2 fm I 3( 5 i, • • .,g M ) G H (d), f m = Tig m , (E^ =1 IIjUSJ " < R\ ■ 

In iRaskutti et al.l (|201dt ). the minimax learning rate on H £ '°(R) (i.e., p — 00 and q = 0) was 

derived^. We show (a lower bound of) the minimax learning rate for more general settings (p — 2, 00 
and < q < 1) in the following theorem. 

Theorem 3 Let s = yq^- Assume d < M /A. Then the minimax learning rates are lower bounded as 

follows. There exists a constant C\ such that for R2 > \J dl ° g \f I — , the radius of the i^-mixed-norm 
ball, we have 

inf sup E[||/ - /*||| 2(n) ] > f dx^^x^^ + ^g(^£M) , (9) 

where 'inf is taken over all measurable functions of the samples (xi,yi)f =1 and the expectation is 
taken for the sample distribution. Similarly, we have the following minimax-rate for p = 00: 

inf sup E[||/- r\\>] > C 1 (dn-rhR^ + d ^e(M/d) \ (1Q) 
/ /*e«f 5, (fl.oo) V n / 



for R Q 



log(A//d) 



The set Fm,ci,h{R) in IRaskutti et al.l (|2010l ) corresponds to Hf°(R) in the current paper. 
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A proof of Theorem [3] is provided in Appendix [E] 

Obviously, our learning rate ([7]) of elastic-net MKL achieves the minimax optimal rate © on the 
£2-mixed-norm ball if M ^> d. Moreover, the optimal rate (9) on the ^-mixed-norm ball is always 
faster than that of ^oo-mixed-norm (10). To see this, let Roo,g* ■— max m ||ff5nll?fmj then we always 
have i?2,g* < VdRoo,g* and consequently we have 

d^n-^R^, < dn~^R^ g ,. 

Now we consider two examples, "inhomogeneous setting" and "homogeneous setting", to compare 
these two bounds: 

1. ||.g* n ||-H m = rnT 1 (Vm G Iq — {1, • ■ • ,d}) (inhomogeneous setting): In this situation, Roo, g * = 1 
and i?2,g* < 1< Thus, the learning rate ([7]) of elastic-net MKL and the minimax rate on the 
£2-niixed-norm ball are d T + ¥ n _T + I + dlo si M ) anc j that on the ^-mixed-norm ball is dn^ 1 ^ 1 + 
diog(M) ^ therefore, in the first term (the leading term with respect to n), there is a difference 

in the <i T + I factor. This difference could be \fd in the worst case. Thus, there appears large 
discrepancy between the two rates in high-dimensional settings. 

2. ||.9*,J|-H m = 1 (Vm g Iq) (homogeneous setting): In this situation, i?oo,g* = 1 and i?2,g» = Vd. 
Thus, all the bounds are dn~~ + rfl °s( M ) 4 Here we observe that the learning rate (O of elastic- 
net MKL coincides with the minimax rate on the ^-mixed-norm ball. We also notice that the 
homogeneous setting is the only situation where those two rates coincide with each other. As 
seen later, the existing bounds by previous works are the minimax rate on the ^oo-mixed-norm 
ball, thus are tight only in the homogeneous setting. 

3.3 Comparison with existing bounds 

Here we compare the existing bounds and the bound we derived. Roughly speaking, the difference 
from the existing bounds is summarized in the following two points: 

(a) Our learning rate achieves the minimax-rate of ^-mixed-norm ball, instead of the ^oo-mixed- 
norm ball. 

(b) Our bound includes the smoothing parameter q (Assumption [2]), and thus is more general and 
faster than existing bounds. 

The first bound on the convergence rate of MKL was derived by iKoltchinskii and Yuan! (|2008T >. 

M * Il2 

which assumed q = 1 and g X)me/ \\f*\\™ m — ^ ' Under these rather strong conditions, they showed 

the bound d~n~~ + dlog ( M ) . For the smooth case q = 1, we obtained a faster rate rT~ instead 
i 

of n 1 + s in their bound with respect to n. 

i 

The second bound was given bv lMeier et al.l (|2009() . which showed d ^ 1 ° s ^ M - > j 1+ ' for elastic- net 

regularization (fTJ under q — 0. Their bound almost achieves the minimax rate on the ^oo-mixed- 
norm ball except the additional log(M) term. Compared with our bound, their bound has the 

log(.M) term and the rate with respect to d is larger than d~ in our bound. 

Most recently. IKoltchinskii and Yuan! (|2010D presented the bound n~~ (d + J2mei a WfrnW^m) + 

d '"ra^ f° r 9 = 0- Their bound is exactly the minimax rate on the foo-mixed-norm ball. However, 

their bound is d 1 ^ 7 times slower than ours if the ground truth is inhomogeneous. For example, when 

\\fm\\u m = m- 1 (to € J = {1,. ■ • ,4) and ]* m = (otherwise), their bound is rT^d + dl °g (M) , 

while our bound is n _T +=c? T + 7 + rfl ° s ^ M ^ , 

n 

All the bounds explained above focused on either q = or 1. On the other hand, our analysis is 
more general in that the whole range of < q < 1 can be accommodated. 

The relation between our analysis and existing analyses are summarized in Table 13.31 

4 Conclusion and Discussion 

We presented a new learning rate of elastic-net MKL, which is faster than the existing bounds of 
several MKL formulations. According to our bound, the learning rate of elastic-net MKL achieves 
the minimax rate on the ^2-niixed-norm ball, instead of the ^-mixed-norm ball. Our bound includes 
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Table 1: Relation between our analysis and existing analyses. 
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a parameter s representing the complexity of the RKHSs and another parameter q controlling the 
smoothness of the truth. Under a natural condition, the learning rate becomes faster as s becomes 
small or q becomes large. Although the existing works concluded that MKL is optimal in a sense that 
it achieves the minimax rate of the foo-mixed-norm ball, we presented that elastic-net MKL further 
achieves the minimax rate of the ^-mixed-norm ball which is faster than that of the foo-mixed-norm 

ball. 

iKoltchinskii and Yuan! (|2010l) considered a variant of £% regularization: J2m=i ^^WfmWn + 

A^H/mll-H™- They showed that MKL with that regularization achieves the minimax rate of the 
^oo-mixed-norm ball. It might be interesting to investigate whether that regularization also achieves 
the minimax rate of the £2-mixed-norm ball or another faster rate. In particular, it is interesting to 
study whether the smoothness parameterization (q) gives a faster rate also for that i\ regularization. 
If not, that might explain the effectiveness of the elastic-net regularization in real data experiments. 

A Covering Number 

Here, we give a detailed characterization of the covering number in terms of the spectrum using the 
operator T m . Accordingly, we give the complexity of the set of functions satisfying the Convolution 
Assumption (Assumption [5]) . We extend the domain and the range of the operator T m to the whole 
space of i 2 (II), and define its power T£ : ^(n) — > £2(11) for (i e [0, 1] as 

00 

Tif := £ 4 tm {f, 4> k ,m)L 2 (n)<t>k,m, (/ G L 2 (n)). 
fe=i 

Moreover, we define a Hilbert space T-L m ,p as 

^m,/3 := {EfcLl Mfc.m I Efctl fJ-k!L b l ^ °°}' 

and equip this space with the Hilbert space norm \\Yl C k=i b k4>k,m\\- H & '■= J Y^hLi Mfe m^fe - ® ne can 
check that T-L m ,i = Hm- Here we define, for R > 0, 

W m (R) := {f m = T,lg m \ g m e H mi \\g m \\u m < R}- (11) 
Then we obtain the following lemma. 

Lemma 4 T-L c } n (l) is equivalent to the unit ball of H m ,i+ q : 7^(1) = {/ m £ H m ,i+q | ||/m||?/ m < !}■ 
This can be shown as follows. For all f m € %5n(l), there exists g m € 'H m such that f m — T,ng m 

q _ <i 

and \\g m \\-H m < 1. Thus, g m = (T^)" 1 /™ = T,T=i Mk,m(/> (j>k,m) L 2 (n)<f>k,m and 1 > \\g m \\-H m = 

E/~i Mfe.m^' <^,™>L(n) = EfcLi t*k}m +q) (fi ^m) 2 L2(ny Therefore, / G H m is in H q m (l) if and only 
if the norm of / in T-L m , i+g is well-define d and not greater than 1. 

Now Theorem 15 of iSteinwart et all (|2009T) gives an upper bound of the covering number of the 
unit ball Bu m fj in % m ,i3 as N{e, Bu m « j ^(n)) < Ce~ 2 £ , where C is a constant depending on c, s, j3. 
This inequality with j3 = 1 corresponds to Eq. (|3]). Moreover, substituting f3 = 1 + q into the above 
equation, we have 

A^(£,?C(l),L 2 (n))<Ce- 2 ^. (12) 
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B Proof of Lemma [T] 

Proof: (Lemma [l} For J = I c , we have 

p f 2 = ll//|li 2 (n) + 2 (/i) /j)i 2 (n) + ll/j|l! 2 (n) > ll/Hli 2 (n) _ M 7 )ll/ilU 3 (n)||/.7||z, 2 (ri) + ||/j|li 3 (n) 



> (1 - p(/) 2 )||Ml! 2(n) > (1 - p(/) 2 ) K (I) (E ||/m|lL(n) 



where we used the inequality of arithmetic and geometric mean in the second inequality. 



C Talagrand's Concentration Inequality 



Prop osition 5 (Talagrand's Concentration Inequality ( Tala grandl . Il996l . iBous quet. 
2002)) Let Q be a function class on X that is separable with respect to co-norm, and 
be i.i.d. random variables with values in X. Furthermore, let B > and U > be B := 
sup gg g E[(g — E[g]) 2 ] and U := sup g6 g |M|oo> then there exists a universal constant K such that, for 
Z := sup 9eg | ^ Yh=i 9(%i) ~ Hg}\, ™e have 



P \ Z> K 



E[Z] 



Ut 
n 



< e 



for all t > 0. 



D Proof of Theorem [2] 

For a Hilbert space Q C L%(P), let the i-th entropy number a(Q — > L(P)) be the infimum of e > 
for which Af(e, Bg, L^P)) < 2 t_1 , where Bg is the unit ball of Q. One can check that if the spectral 
assumption (A[3j) holds, the «-th entropy number is bounded as 



ei (H m ^ L 2 {Tl)) <&~^. 

where c is a constant depends on s and c. 

The following proposition is the key of the localization. 



(13) 



Proposition 6 Let B^ a ,b C H m be a set such that B a , a , b = {f m € H m | ||/ m |U 2 (n) < °~i ll/m||« m < 
«) ||/m||oo < b}. Assume the Spectral Assumption (J^Bj), then there exist constants c s ,C' s depending 
only s and c such that 



E 



sup 



1 n 

E®% fm (%i ) 



5 fea) s 



V (cga^+o&^n !+* 



Proof: (Proposition [6]) Let D n be the empirical distribution: D n = r^iLi^xi- To bound 
empirical process es, a bound of th e entropy number with respect to the empirical L2-norm is needed. 
Corollary 7.31 of lSteinwarti (|2008ft gives the following upper bound: under the condition (|13|) . there 
exists a constant c s > only depending on s such that 

E „~n«[ei(% m -> L 2 (D„))] < c s ci"^. 

Finally this and Theorem 7.16 of lSteinwarti (|2008l ) gives the assertion. N 

Using the abo ve proposition and the peeling device, we obtain the following lemma (see also 
iMeier et all (|2009T) 1. 

Lemma 7 Under the Spectral Assumption (Assumption^), there exists a constant C s depending 
only on s and C such that for all A > 



E 



sup 

f m en m :\\f m \\n m <i 



\^YTi=l^fm{Xi)\ 



/™ll| 2 (n) + ^ 



<C„ 



1 



A2 IjHa 
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Proof: (Lemma [7D Let U m (a) := {f m € H m \ \\f m \\u m < L \\fm\\L 2 (u) < ^} and z = 2 1 /* > 1. 
Then by noticing ||/ m ||oo < ||/m||« m , Proposition [5] gives 



E 



<E 



--C' 



<2C[ 



sup 

fm&n m :\\f m \\n m <l 



2 _1_ \ 
miiL 2 (n) ~ r 



sup 

/ m ew m (Ai/2) 



i^ Er=i^/m(^)i 

ll/™HL(n) + A 



sup 

fme-H m (z k \ 1 / 2 )\-H m (z k - 1 X 1 / 2 ) 



I n °~ifm( x i)\ 
Wfm\\ L2 (n) + A 



A— ~c s s 
X2 y/n 



E^ 

fc=0 



z fc(i-s)AVcf 



A~2 



n l+s 



E^ 

k=0 



A~3 



n l +3 



1 



<2CI 25 



1 — z s V // 

2 1/ S 



1-2;- 



2V»-l 

By setting C s <- 2C^ (25 s 1 2 



A"* 



2C: I 2c^/ 

n 



2 l / s .J^ / A-2 



we obtain the assertion. 



The above lemma immediately gives the following corollary. 
Corollary 8 Under the Spectral Assumption (Assumption^, for all A > 



E 



sup 

f m en„ 



ll/ m ll 2 2 (n) + A ll/mllw„ 



<C S 



'n A 2 ni+s 

where C s is the constant appeared in the statement of Lemma^ and we employed a convention such 
that § = 0. 

Moreover we obtain the following corollary. 
Corollary 9 Under the Spectral Assumption (Assumption^ , for all A > 

I n X/i=l e ifm( x i)\ 



E 



sup 



\\f, 112 



m llL 2 (n) A ll/mll« r , 



< 2C*,L 



A" 



In \i. n — 



where C s is the constant appeared in the statement of Lemma [7J 

Proof: (Corollary [9]) Here we write Pf = E[/] and P n f = i Y%=i f{ x it Vi) f° r a func- 
tion /. Notice that Pef m = 0, thus ^Y^—xIiJ m(xi) = {P n — P){tfm)- By the symmetriza- 
tion argument (|van der Vaart and WellneiT Il996l Lemma 2.3.1) and the contraction inequality 
(jLedoux and Talagrandlll99ll Theorem 4.12). we obtain 



E 



sup 



\(P - P n )(ef m )\ 



U 112 



L 2 (n) + A ll/mllH„ 







=E 


sup 







(P-Pn) 



2 -A||/„ii 2 



<2E 



sup 

f m £H r . 



\J\Jm\\ ,__.(,, , r -v ., ,„ :| u 
— Ylri=l <7 i € ifm( x i) 



U 112 



AH/mll-^^ 



<2LE 



sup 



m iii 2 (n) 

n En=l °~ifm{Xi) 



ll/m||| 2 (n) + All/mil^ 



10 



<2C,L 



A" f 



1 



fa \hn~^r> 



This gives the assertion. ■ 
Prom now on, we refer to C s as the constant appeared in the statement of Lemma [7] We define 



as 



<t> s = 2KL(C S + 1 + d). 
Remind the definition of £„ (Eq. Q), then we obtain the following theorem. 

Theorem 10 Under the Basic Assumption, the Spectral Assumption and the Supnorm Assumption, 
when lo ^ / - ) < 1, we have for all A > and all t > 1 

~ e i(fm(xi) ~ fm( x i)) < (A) / m || | 2(n) + A||/ m ||^ m max (l, Vi, i 

(V/ m 6K m ,Vm = l,...,M), 
tuii/i probability 1 — exp(— f). Moreover we also have 



E 



max sup 



< 40 s ^ 



Proof: (Theorem I10|) Since 

ll/m||z, 2 (n) 

1 1 yVn. 1 1 i 2 (n) 

1 1 fm 1 1 oo 



< 1, 



< 



ClWfr, 



n-s 
\l 2 (h) 



\\Ul 2m +M\U\ 2 n n 



\\f m \\^ m Young ClA-^H/JI^+AH/, 



"™"L 2 (n) 
< CiA"*, 

applying Talagrand's concentration inequality (Proposition [5]), we obtain 

I n e ifm(Xj)\ 



\\fm\\l 2in) +\\\fm\\ 2 



P | sup 

f m eu r . 

< e"*. 



+ A||/ m || 

Hr. 



> K 



2C S L£ 



LH , CiL\-2t 
n 



Therefore the uniform bound over all m = 1, . . . , M is given as 

I n Si=l e i/m(2 ; i)l 



P max sup 



M 



™ l llL 2 (n) 'MUmM^ 
1 



< 



I „ Si=l e ifm{ 



m— 1 

<Me"*. 

Setting £ i + log(M), we have 



H^ m llL 2 (n) + ^ll/mlllf,. 



> K 



> K 



2CMn + \ 



LH , dLX'U 
n 



2C s L^ n + W 



L 2 t , dL\-it 
n 



P max sup 

m /meHr. 

<e" 4 . 
Now 



ll/ m lli 2 (n) + ^ll/™ll«„ 



> if 



2C s L£ >n + 



L 2 (i + log(M)) CiLA-9(* + log(M)) 



(14) 



'L 2 (t + log(M)) CiLA-4(t + log(M)) 



< 



V n V n yn \v n v n / 



11 



< ( Lyft + L + dL— + C 1 L) < £„ (2L + 2dL) n(t) 



where we used log ^- ) - ) < 1 in the second inequality. Thus Eq. (fT4|) implies 



P I max sup 



ll/J 12 



£ 2 (n) + 
<e"*. 

By substituting </> s = 2KL(C S + 1 + Ci), we obtain 



> K{2C S L + 2L + 2C 1 L)£ n r ] (t) 



P max sup 



™"L 2 (n) 



A||/ro||% 



which gives the first assertion. 

Next we show the second assertion. Eq. (fT5|) implies that 



E 



max sup 



H/ m llL 2 (n) + ^ll/™ll«„ 



4=0 



(15) 



where we used 77(4 + 1) = max{l, \Jt + 1, (4 + l)/-y/n} < £ + 1 in the second inequality. Thus we 
obtain the assertion. ■ 

Moreover we obtain the following bound for the difference of the empirical and the expectation 
L2-norm. Let <j>' s be 

4>' s = K [l6Kd(C a + 1 + Ci) + Ci + C 2 ] . 

We define ( n (r, A) as 

r 2 log(M) r 



C„(r, A) := min 



Theorem 11 Under the Spectral Assumption and the Supnorm Assumption, when '"^^ < 1, for 
all A > we have 



Em=l /« 



Z^m=l J" 



i 2 (n) 



,1/ 



< max(0' s ^ 2 (A),r) ]T ^||/ m ||i 2(n) + A|l/m|&„ 

\m=l 



(V/ m e H m (m = l,...,M)), 
w'i/i probability 1 — exp(— £„(r, A)). 
Proof: (Theorem ITT]) 



E 



<2E 



sup 



EM /■ 
m=l J" 



Em=l /" 



£a(n) 



(Em=l ^/IT/' 



2 -A||/„n 2 



m "L 2 (n) 



sup 

fmeHr, 



n Si=l ff i(Em=l fm{ x i)) 



2 -A||/„n 2 



m "L 2 (n) 



< sup 



f^E™=iJ\\U\l 2{n) + M\U\ 2 n „ 



x 2E 



sup 



n Si=l f7 i(5Em=l fm( x i)) 



f^E™=iJ\\U\l 2(n) + X\\U\ 2 n„ 



,(16) 
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where we used the contraction inequality in the last line (jLedoux and Talagrandl Il99l[ Theorem 
4.12). Here we notice that 



A/ 



m— 1 



m—1 

M 



m — 1 



M 



12 ^sA||/ m ||^ < ]T CiX-iJWUWl^ + X\\f m \\ 2 H „ 



m—1 m—1 

where we used Young's inequality a 1_s 6 s < (1 — s)a + sb in the second line. Thus the RHS of the 
inequality (|16[) can be upper bounded by 



2CiA"5E 



<2CiA-2E 



sup 



n ff i(Em=l fm{%i)) 



n 5Ei=l "i/roO^i) 



sup max 

f m £H m m 



U 112 



where we used the relation °'" < max m (| 2i! -) for all a m > and b rn > with a convention ^ = 0. 
Therefore, by log ^ < 1 and Theorem [10] where Oi is substituted into e^, the right hand side is upper 

bounded by 16KC\(C S + 1 + Ci)A~i£„. Here we again apply Talagrand's concentration inequality, 
then we have 



P sup 



EM r 
m=l /« 



Em=l /« 



L 2 (n) 



> K 



m iiL 2 (n) 
16KCi(a + l + Ci)A-^. 



CiA 5 + — - 

n n 



< e 



(17) 



where we substituted the following upper bounds of B and U: 



B = sup E 

f m eU m 



< sup E 

f m eu m 



(Em=l 



(5Em=l /« 



m=l ■/« 



Em=l ll/m|U 2 (n) ) (Em=l \/ II fm II L 2 (n) + ^ll/™ll« 



<Z%=iCi>r*J\\f m \\l 2{n) +M\fm\\ 2 n J 



Eltl ll/mlliatn) 

< SUp — 2 

/ m S« m ^X)m=l ll/rolU 2 (n)J (Em=l 0/i 

<C 2 A- S , 



2 \2 



2 -A||/„" 2 



m "i 2 (n) 



where in the second inequality we used the relation E[(^ m=1 f m ) 2 } — E[^E m m , =1 f m fm'} < 



Hm,m' = l ll/m|U 2 (n)||/m'|U2(n) - (Z)m=l 1 1 fm 1 1 L 2 (II) Y , and 

(Em=l fm) 2 



[7 = sup 



Em=l x/ll/rollwill + ^11/' ' 



L 2 (n) 



(E^^A-fJll^n^^ + AH/™!!^) 2 



< sup 

fm£U, 



(j2m=l \f^f'' 



2 -A||/„n 2 



m "i 2 (n) 



<cl\- s . 

Now notice that 
if 



16XCi(C s + l + Ci)A-^n 



CxA ^ + __1 

71 71 
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<y/nK 



<VnK 



Jn 



I ^ /log(M) , GUlt 



log(M) 



16KC\(C S + 1 + C\) 



1 t Cft 
log(M)° 1+ y/n 



£ 2 



Therefore Eq. (|17[) gives the following inequality 



sup 

fmGHr 



EM p Y^JW r 



-.M 



(E m =i y / 'll/mlli 2 (n) + ^ll/™llli r . 



<K [l6Kd(C s + 1 + Ci) + Ci + C?] V< 2 max(l, y/t/log(M), t/y/n). 

with probability 1 - exp(-i). By substituting 0' s = K [l6KCi(C s + 1 + Ci) + Ci + C 2 ] and i = 
C„(r, A), we have 



sup 



Etn=l /" 



EM r 
m=l J™ 



L 2 (n) 



(Em=l \J\\fn 



li 2 (H) + ^ll/™ll«„ 



<^' s ^max 1, 



/Cn(r,A) Cn(r, A) 
_ log(M)' ^ 

with probability 1 — exp(— £ n (r, A)). 



< &' s Vn£,n max ( 1, T 



< max 



(&' s Vn£,l,r) 



Now we define 

cp s := max l) = max (K [lQKC x {C s + 1 + d) + Ci + C 2 ] , 2i^i(C s + 1 + d), l) , 

where is the universal constant appeared in Talagrand's concentration inequality (Proposition [5]). 
We dehne events $i(t) and S^ir) as 



| ^XJ e ^™(^) < V{t)4> s $n^\\fm\\l 2(n) + M\fm\\ii m , V/ m 6 Vm = 1, . . . , M j , 



Z^m=l J" 



£j(n) 



< max(</> s V< 2 ,r) ^ J| |/m||| 2(n) + A||/m||« r . 



V/ m € "H m , Vm = 1,...,M 



Theorems [TU1 and [TT1 give that P(#l(*)) > 1 - e~* and P{S 2 {r)) > 1 - exp(-Cn(r, A)) under some 
conditions. 

The next lemma gives a bound of irrelevant components (m € /q) of / in terms of the relevant 
components. 

Lemma 12 Set Aj" = 4<f> s r](t)^ n (X), A2 = A, A3 = A /or arbitrary A > 0. Then for all n and 
r(> 0) such that ~/=^ < 1 max (0s%AiC 2 (A), r) < |, we ftasue 



M 



^ yll/m /m|ll 2 (n)+^2 'll/m ~ /mll«„ 



I 1 

ni6/o 



a£° 2 I 



v(») 



ll/m /™lli 2 (n) +^2 ^ll/m /™ll« m i ' 



(18) 



ttrai/i ■probability 1 — exp(— t) — exp(— C„(r, A)). 
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Proof: (Lemma I12|) On the event &i(r), for all f m £ H m we obtain the upper bound of the 
regularization term as 



\\fmP n + X^\\fm\\ 2 n 



< yj H/mlli 2( n) + max(0 s ^a(A),r)(||/ m ||i 2(n) + X\\f m \\ 2 n J + ^WUl^ 

< ^(ll/-lli 2 (n) + 4 n) ll/ m |lli m )> (19) 
because max(</> s \fn£^ l (A) , r) < | and A = X^ . On the other hand, we also obtain a lower bound as 



\\fm\\l + X^\\f m \\l m 

> ^\\fm\\l 2{n) ~ max(^^ (A))r)( || /m ||2 3(n) + X \\f m f n J + A^||/ m ||^ m 

> ^(ll/mllL(n)+ A 2 n) |l/m||^ m ), (20) 

for all f m e H m . 

Note that, since / minimizes the objective function, 

M 



ll/-/*l£+E( A i Vll/™ll« + A 2 \\U\ 2 n m + *r\\UnJ 

in—1 

1 n M 



n=lm=l m£l 

This implies 



H/-/X+ E A^Viiz-iin+^n/™!!^ 

m€/ c 

<- E E 6 i(fm( x i) ~ fm( X i)) 



n— 1 m— 1 



+ E ( A i V "ft ^ /-H» + A 2 'lift - /-IIL + A 3 (ll/mllw™ - II/-IID)- 
ra£7j 

Thus on the event £ x (t) and £ 2 (r), by Eq. (JT5J) and Eq. ([2D)), we have 

n/-ri£+5 E A ^Vii/™iiL(n)+ A 2 n) n/-ii^ 



2 



< E <lV)<l>'Zny/\\fm- f*n\\l 2{n) + X [ 2 n) \\f m - f m \\ 2 nm + 



rn—l 



E(^ A i"VHft- /-Hi 2 (n)+4 n) ||/^- /™|I!< TO + Ai n) (2(/* X- ll/m-/4llLH 21 ) 



4 E A^'^/ll/mlli^n) + A 2 ™ ) ||/m||« m 

<E fl^V^ft - /-lli.cn) + A 2 n) llft - /™ll« m + 2Xf\T m g m J m - f m ) Hr , 



,4 

me Jo 



Now by the Young's inequality for positive symmetric operator, we have 

A 3 1 m — A 3 A 3 J mA 3 I A3 
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(n) 



Thus 



■ (n 



(n 



<x 



(n 



<x 



(n 



=x 



in 



(n 



1 qT m + (1 - q)X y 3 



(fm,fm fm/Hm 
\Tm9m> fm ~ fm/H m 

^hm\\n m \\x^Ti(f m -f m )\\ H „ 



\\9*m\\u m \l {fm ~ fm, [QTm + (1 ~ Q)^) f*n ~ fm) 

\\9m\\n m y/q\\f m - /m|l! 2( n) + (1 - q)^ l) \\f m - fm\\ 2 n 
\\9U\n m yJ\\fa- L\\l 3{n) + ^llfm- M& m - 



(22) 



Therefore we have 

z E Ai n Vii/ m iii a(n) + ^ii/,„:: 



meis 



< 



a^'ll^lk, 



ll/m /™llL 2 (n) ~^ ^2 ^ll/m 



with probability 1 — exp(— t) — exp(— C, n {r, A)). The assertion is obvious from this bound. M 
The next theorem immediately gives Theorem [2j 

Theorem 13 Let = 4<fi s r)(t)£ n (X), X% = X, X^ — X for arbitrary A > 0. Then for all n and 
r(> 0) satisfying log (y < \ an d the following inequality: 



.(„)!+« 



128max(^^(A), r) [d + £^ =1 \\g m \\^ 

(1 - p(Io) 2 Hl ) 



< 



(23) 



have 



M 



,B) '-* (n)14 * Elicit 



luii/i probability 1 — exp(— i) — exp(— £ n (r, A)) for all t > 1. 
Proof: (Theorem IT3]) By Eq. ([2T]). we have 



ii/- r r 



L 2 (n) 



E^VllZ-Hn + ^ll/JI^ + A^ll/, 



,n \\Hr, 



me in 



J2^ ] \\fm-f 

m&Io 



mWUr. 



n M 



<{\\l-n 2 



L 2 (n) 



11/ - rn») + - E E - 



n—l m— 1 



- E( A i y/\\fm-fi 



f *illn+^2 ll/m /mll% m +-^3 ^(/ml /m fm)m)- 

Here on the event Siir\ the above inequality gives 

n/-/iL(n) + ^ E(M n Vii^iil 2 (n)+ A 2 n) ii/-iik+4 n) ii4iik)+ E 4 B) uA»-/. 



mll« r , 



m£l, 
M 



o 



771 E Jo 



<max(0 sV / n^,r) 



A/ \ " n M 

E Vii-^-^ni a (n) + A ii/--/mii^ m + - E E e *(/U*o - 

'n—l / n—l m— 1 



E/ ( O^l ^ \f\\fm /rnll| 2 (n) + ^2 ^ll/m /mll« m + ^3 ^(/m)/m /to)? 



roEio 



(24) 
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Moreover notice that the assumption (f2U[) implies max^^^r) < |. Thus Eq. (|T8|) in Lemma 

m holds. 

Step i. (Bound of the first term in the RHS of Eq. ([2l))) By Eq. (HU) in Lemma [T2l the first term 
on the RHS of Eq. (|24|) can be upper bounded as 

/ M ^ 2 



max(0 sV /< 2 ,r) ^ - /;„|| 2 L2(n) + X\\f m - f*J 2 



£2(11) ^ 'Num Jm\m„ 

1+9 



<ma*fo.VS&r) ( 8 £ ( 1 + ^- ^f" 11 *" ] - / m ||£ 2(n) + A^ll/m - / m ||^, 



/ \( n ' Y^A'i II * 112 \ 

<128max(0.^r)(d+-5 ^ £ (||/ m - /* ||| 2(n) + A^ll/™ - &|& m ) 

A x J melo 

,.), , — ,2 . I 7 1 A 3 ' S m =l lbmll^ m 1 / II/ - / lli 2 (n) , \ ' \(n)n? r* 1 1 2 

<128max(0 s V<„,r)(d+ I ^ fl _ p(m ^ o) + A 2 II/-" Un m 



(25) 

By assumption, we have 128 ^g^p^y^ (d + A ""' - ^jgj l|g;j *™ ^ < ±. Hence the RHS of the 
above inequality is bounded by | - /*||| 2(n) + Eme/o A^ll/m - /mill/ 

5iep i?. (Bound of the second term in the RHS of Eq. ([24]) ) By Eq. (IT8l) in Lemma H21 we have on 
the event S\ 

n M M 

-EE ^U^) - < ]T V(t)4>.tny/\\fm ~ / ro |IL ( n) + A ll/- - f™\\n m 

n— 1 m— 1 m— 1 



iW 2 



1 + ^ Jit" v(t)<l>sZ n yJ\\f m - fL\\l 2i u) + 4" } ll/m - U 

1 / 

256^(t) 2 ^ / Xt )1+q v iu* ,,2 
" (l-p(Io)»)«(J ) ^ + ^p - ^ llflJI * 

(WWW v-/|,f .*|,a , A H||f r i|2 

T g 2^ \JI-/ m /mlli 2 (n) + A 2 ll/m Jmlm, 



16 I n 

- (i-p(i rHi ) ' 



l n)2 +Ar )1+9 £ u&nsJ +gfii/-nw E4 n) n/--/-n0 ■ 

m.=l / \ mG/ / 



(26) 



Siep 5. (Bound of the third term in the RHS of Eq. (pM)) ) By Cauchy-Schwarz inequality, we have 

E A 1 ^\/ll/ m ~ /mlli 2 (n) + A 2 ^l/m - /mllw™ 

s i2 



-2(l-p(/ ) 2 M/o) 1 8 j^/ /^H^(n)+ A 2 II/- /mlk 

Step ^. (Bound of the last term in the RHS of Eq. ([23])) By Eq. ([2"2]). we have 



2 
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^ 8 Em=l \\9* m \\n m , (1 - P(Iq) 2 )k(Iq) \ - /,,? ,, ||2 «||2 

-(l-p(/ )2) K (/ ) dA3 + 8 2^l/m-/ m || ia(n) + A 3 HZ™ AJw, 



- (l E ^)Wol ^" )1+g + l " r " L(n) + ^ 4 n) ll/ m -/™llL ) ■ (28) 



Step 5. (Combining all the bounds) Substituting the inequalities (J5SJ), (US]), (J2ZJ) and (g5]) to Eq. (j2H) . 
we obtain 

2 11/ — /*lli 2 (n) 



(l-p(J ) 2 M/o) V dAl +Aa + 2(l-p(/ ) 2 M/o) 1 + (l-p(J ) a MJo) a 

M 
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-(l-p(7o) 2 M/o) , 

This gives the assertion. ■ 
E Proof of Theorem [3] 

Proof: (Theorem [3]) The ^-packing number A4(8, G, L 2 (P)) of a function class Q with respect to 
L2(P) norm is the largest number of functions {/i, . . . , Jm} C such that — /j||l 2 (p) — ^ f° r 
all i j. It is easily checked that 

M{8/2, G, L 2 (P)) < M(6, G, L 2 {P)) < Af(5, g, L 2 (P)). (29) 
First we give the assertion about the ^-mixed-norm ball (Eq. (|10j) ). To simplify the notation, set 
R = Poo- For a given 5 n > and e n > 0, let Q be the S n packing number M(S n , V. e ' q (R), L 2 {Ii)) of 
Hf«(R) and N be the e n covering nu mber Af(e n , Hf_ q {R ) , L 2 (J l)) oiUf q {R)- IRaskutti et all (|2010l) 
utilized the techniques developed by lYang and Barronl (jl999) to show the following inequality in 
their proof of Theorem 2(b) : 

inf sup E[||/-r||| 2(n) ] >inf sup §■ P[\\f - f llL(n) > #/2] 

> *l( 1 _ log(AQ + ^e 2 +log(2) 
" 2 ^ log(Q) 

Now let Q m := M (s n /Vd, U q m (R) , L 2 (ufj (remind the definition of H q m (R) (Eq. fTTJ)). and since 

now % m is taken as T-L for all m, the value Q m is common for all m). Thus by taking S n and e„ to 
satisfy 

^e 2 < log(JV), (30) 

41og(JV) <log(Q), (31) 

the minimax rate is lower bounded by In Lemma 5 of IRaskutti et all (|2010l ). it is shown that if 
Qi > 2 and d < M/4, we have 

tog(Q)~d!og((5i) + dlog^ . 

By the estimation of the covering number of "H^(l) (Eq. (|12p). the strong spectrum assumption 
(Eq. ©) and the relation we have 

Thus the conditions (|31[) and (|30p are satisfied if we set S n = Ce n with an appropriately chosen 
constant C and we take e n so that the following inequality holds: 

ns 2 n <d 1+s R 2S e- 2S + dlog(^ 
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It suffices to take 

i 2, d\og(% 

,-T+I RT+S -I 2_liL 



l~dn-TTiR& + "'" 6u; . (32) 
n 



Note that we have taken i? > y lp g(^ f / d ) ; ^}j US Q m > 2 is satisfied if we take the constant in Eq. (|32|) 
appropriately. Thus we obtain the assertion (|10[) . 

Next we give the assertion about the ^-mixed-norm ball (Eq. ([9])). To simplify the notation, set 
R = R 2 . Since Uf 2 q (R) 2 Hf*(R/Vd), we obtain 

inf sup E[||/-r||| 9(n) ] >inf sup E[||/ - /*||| 2(n) ]. 
/ f*enf' 9 (R) f f*en£ q (R/Vd) 



Here notice that we have -j= > y ]SSLMMl by assumption. Thus we can apply the assertion about 
the ^-mixed-norm ball (fTU|) to bound the RHS of the just above display. We have shown that 

— d\os(— ) 

inf sup E[||/ - r|| 2 L2(n) ] > dn-rh(R/Vd)^ + &{d > 

f f£-Hf q (R,/Vd) 71 



d — n - — R— + 2A_d_2 



This gives the assertion ([9]). 
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